added healthcheck support for agent #820

aritrbas · 2025-10-21T22:33:16Z

Summary

This adds HTTP-based health check endpoints for the Calico VPP agent, replacing the existing restart-on-timeout behavior with Kubernetes readiness and liveness probes.

Previously, the agent container would restart frequently while waiting for Felix configuration updates. This caused pods to appear Running even when not fully initialized making it difficult to distinguish between initialization delays and actual failures.

Now, we report initialization status through standard Kubernetes probes, keeping the container running during initialization by marking it as Not Ready. This allows Kubernetes to manage pod lifecycle based on health check status.

Changes

1. New Health Package (`calico-vpp-agent/health/`)

Created a new package with:

health.go: HTTP server with three endpoints:
- /liveness: Basic health status (for liveness probe)
- /readiness: Initialization status (for readiness probe)
- /status: Detailed JSON status (for monitoring/debugging)

2. Configuration Changes (`config/config.go`)

Added healthcheck port configuration:

// HealthCheckPort is the port on which the health check HTTP server listens
// Defaults to 9090
HealthCheckPort *uint32 `json:"healthCheckPort"`

The healthcheck port can be customized via ConfigMap:

  CALICOVPP_INITIAL_CONFIG: |-
    {
      "healthCheckPort": 9090,
    }

3. Deployment YAML Changes (`yaml/base/calico-vpp-daemonset.yaml`)

Added Kubernetes health probes to agent container:

  startupProbe:
    failureThreshold: 10
    httpGet:
      path: /liveness
      port: 9090
      scheme: HTTP
    initialDelaySeconds: 30
    periodSeconds: 30
    timeoutSeconds: 3

  livenessProbe:
    failureThreshold: 3
    httpGet:
      path: /liveness
      port: 9090
      scheme: HTTP
    initialDelaySeconds: 30
    periodSeconds: 10
    timeoutSeconds: 3

  readinessProbe:
    failureThreshold: 3
    httpGet:
      path: /readiness
      port: 9090
      scheme: HTTP
    initialDelaySeconds: 10
    periodSeconds: 5
    timeoutSeconds: 3

Components Tracked

The health system tracks the initialization of these components:

vpp: VPP connection established
vpp-manager: VPP Manager ready
felix: Felix configuration received
agent: Agent fully initialized and running

Monitoring

The /status endpoint provides detailed information about the healhcheck status. Here is an example status response:

{
  "healthy": true,
  "ready": true,
  "components": {
    "agent": {
      "initialized": true,
      "message": "Agent fully initialized and running",
      "updatedAt": "2024-10-15T22:30:00Z"
    },
    "felix": {
      "initialized": true,
      "message": "Felix config received",
      "updatedAt": "2024-10-15T22:29:45Z"
    },
    "vpp": {
      "initialized": true,
      "message": "VPP connection established",
      "updatedAt": "2024-10-15T22:29:30Z"
    },
    "vpp-manager": {
      "initialized": true,
      "message": "VPP Manager ready",
      "updatedAt": "2024-10-15T22:29:35Z"
    }
  },
  "message": "All components initialized",
  "lastUpdate": "2024-10-15T22:30:00Z"
}

Signed-off-by: Aritra Basu <[email protected]>

hedibouattour

Thanks for this change, great !
If I understand correctly this is never gonna timeout and crash ? So if felix doesn't send its config at all we are just stuck at notReady state ?

hedibouattour · 2025-10-23T12:00:23Z

calico-vpp-agent/cmd/calico_vpp_dataplane.go

+	healthServer.MarkAsUnhealthy("Waiting for Felix configuration")
+	log.Info("Waiting for Felix configuration...")
+
+	ticker := time.NewTicker(20 * time.Second)


I think we can consider reducing the interval; 20s might be too long for retries

aritrbas · 2025-10-23T15:33:29Z

If I understand correctly this is never gonna timeout and crash ? So if felix doesn't send its config at all we are just stuck at notReady state ?

At startup, Kubernetes uses the startupProbe every 30 seconds after an initial delay of 30s. It will give up after 10 consecutive failures, i.e., after (30 + 10×30) = 330 seconds. So, if this never succeeds, the container will crash after about 5½ minutes.
Once the startupProbe succeeds, the livenessProbe takes over. If it fails 3 times in a row (each 10s apart), Kubernetes will restart the container (≈30s of failure).
The readinessProbe controls whether the pod is Ready to receive traffic. If it never succeeds, the pod will stay in NotReady state (i.e., not added to service endpoints), but it will not crash.

So, if Felix doesn't send its config, the startupProbe will cause us to crash every 5½ minutes until the config is received. Once the config is received and the startupProbe succeeds, we only crash if the livenessProbe fails (that can only happen now if the agent crashes or the Go routine managing the health server goes down).

added healthcheck support for agent

b43056f

Signed-off-by: Aritra Basu <[email protected]>

aritrbas force-pushed the abasu-agent-healthcheck branch from ba97a70 to b43056f Compare October 21, 2025 22:51

sknat requested review from hedibouattour and sknat October 22, 2025 07:59

sknat assigned aritrbas Oct 22, 2025

aritrbas requested review from florincoras and onong October 22, 2025 22:35

hedibouattour reviewed Oct 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added healthcheck support for agent #820

added healthcheck support for agent #820

Uh oh!

aritrbas commented Oct 21, 2025

Uh oh!

hedibouattour left a comment

Uh oh!

hedibouattour Oct 23, 2025

Uh oh!

aritrbas commented Oct 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

added healthcheck support for agent #820

Are you sure you want to change the base?

added healthcheck support for agent #820

Uh oh!

Conversation

aritrbas commented Oct 21, 2025

Summary

Changes

1. New Health Package (calico-vpp-agent/health/)

2. Configuration Changes (config/config.go)

3. Deployment YAML Changes (yaml/base/calico-vpp-daemonset.yaml)

Components Tracked

Monitoring

Uh oh!

hedibouattour left a comment

Choose a reason for hiding this comment

Uh oh!

hedibouattour Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

aritrbas commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. New Health Package (`calico-vpp-agent/health/`)

2. Configuration Changes (`config/config.go`)

3. Deployment YAML Changes (`yaml/base/calico-vpp-daemonset.yaml`)

aritrbas commented Oct 23, 2025 •

edited

Loading